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ABSTRACT 

A new geometrically-motivated algorithm for nonnegative 
matrix factorization is developed and applied to the dis- 
covery of latent "topics" for text and image "document" 
corpora. The algorithm is based on robustly finding and 
clustering extreme-points of empirical cross-document word- 
frequencies that correspond to novel "words" unique to each 
topic. In contrast to related approaches that are based on 
solving non-convex optimization problems using suboptimal 
approximations, locally-optimal methods, or heuristics, the 
new algorithm is convex, has polynomial complexity, and has 
competitive qualitative and quantitative performance com- 
pared to the current state-of-the-art approaches on synthetic 
and real-world datasets. 

Index Terms — Topic modeling, nonnegative matrix fac- 
torization (NMF), extreme points, subspace clustering. 

1. INTRODUCTION 

Topic modeling is a statistical tool for the automatic discov- 
ery and comprehension of latent thematic structure or topics, 
assumed to pervade a corpus of documents. 

Suppose that we have a corpus of M documents com- 
posed of words from a vocabulary of W distinct words in- 
dexed by w = 1, . . . , W. In the classic "bags of words" mod- 
eling paradigm widely-used in Probabilistic Latent Semantic 
Analysis HI and Latent Dirichlet Allocation (LDA) 0, 
each document is modeled as being generated by N indepen- 
dent and identically distributed (iid) drawings of words from 
an unknown H^xl document word-distribution vector. Each 
document word-distribution vector is itself modeled as an un- 
known probabilistic mixture of K < min(M, W) unknown 
W X 1 latent topic word-distribution vectors that are shared 
among the M documents in the corpus. The goal of topic 
modeling then is to estimate the latent topic word-distribution 
vectors and possibly the topic mixing weights for each docu- 
ment from the empirical word-frequency vectors of all docu- 
ments. Topic modeling has also been applied to various types 
of data other than text, e.g., images, videos (with photometric 
and spatio-temporal feature-vectors interpreted as the words), 
genetic sequences, hyper-spectral images, voice, and music, 
for signal separation and blind deconvolution. 

If j3 denotes the unknown W x K topic-matrix whose 



columns are the K latent topic word-distribution vectors and 
9 denotes the K x M weight-matrix whose M columns are 
the mixing weights over K topics for the M documents, then 
each column of the W x M matrix A — f39 corresponds 
to a document word-distribution vector. Let X denote the 
observed W x M words-by-documents matrix whose M 
columns are the empirical word-frequency vectors of the 
M documents when each document is generated by N iid 
drawings of words from the corresponding column of the A 
matrix. Then given only X and K, the goal is to estimate the 
topic matrix f3 and possibly the weight-matrix 9. This can 
be formulated as a nonnegative matrix factorization (NMF) 
problem JU |5] |6] |7) where the typical solution strategy is to 
minimize a cost function of the form 



\X 



(i) 



where the regularization term t/> is introduced to enforce de- 
sirable properties in the solution such as uniqueness of the 
factorization, sparsity, etc. The joint optimization of (JTJ with 
respect to (/3, 9) is, however, non-convex and necessitates the 
use of suboptimal strategies such as alternating minimization, 
greedy gradient descent, local search, approximations, and 
heuristics. These are also typically sensitive to small sam- 
ple sizes (words per document) N especially when N <C W 
because many words may not be sampled and X may be far 
from A in Euclidean distance. In LDA, the columns of j3 and 
9 are modeled as iid random drawings from Dirichlet prior 
distributions. The resulting maximum aposteriori probability 
estimation of (f3, 9), however, turns out to be a fairly complex 
non-convex problem. One then takes recourse to sub-optimal 
solutions based on variational Bayes approximations of the 
posterior distribution and other methods based on Gibbs sam- 
pling and expectation propagation. 

In contrast to these approaches we adopt the non-negative 
matrix factorization framework and propose a new geometri- 
cally motivated algorithm that has competitive performance 
compared to the current state-of-the art and is free of heuris- 
tics and approximations. 

2. A NEW GEOMETRIC APPROACH 

A key ingredient of the new approach is the so-called "separa- 
bility" assumption introduced in [5| to ensure the uniqueness 
of nonnegative matrix factorization. Applied to f3 this means 



that each topic contains "novel" words which appear only in 
that topic - a property that has been found to hold in the es- 
timates of topic matrices produced by several algorithms [8 |. 
More precisely, A W x K topic matrix (3 is separable if for 
each k £ [1, if], there exists a row of /? that has a single non- 
zero entry which is in the fc-th column. Figure [TJ shows an 
example of a separable topic matrix with three topics. Words 
1 and 2 are unique (novel) to topic 1, words 3, 4 to topic 2, 
and word 5 to topic 3. 

Let Ck be the set of novel words of topic k for k £ [1, K] 
and let Co be the remaining words in the vocabulary. Let A w 
and 9k denote the w-th and fc-th row-vectors of A and 9 re- 
spectively. Observe that all the row-vectors of A that corre- 
spond to the novel words of the same topic are just different 
scaled versions of the same 9 row-vector: for each w £ Ck, 
A w = (3 w k9k- Thus if A, f3, and 9 denote the row-normalized 
versions (i.e., unit row sums) of A, (3, and 9 respectively then 
A = JW andjbr all w £ C k ,A w ^ = k (e.g., in Fig. [JJ 
Ax = A 2 = 6i and A 3 = A 4 = 9 2 ), and for all w € C , 
A w lives in the convex hull of 9k's (in Fig. [TJ Aq is in the 
convex hull of 9\ , 9 2 , 63). 
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Fig. 1. A separable topic matrix and the underlying geometric 
structure. Solid circles represent rows of A, empty circles 
represent rows of X. 

This geometric viewpoint reveals how to extract the topic 
matrix f3 from A: (1) Row-normalize A to A. (2) Find ex- 
treme points of A's row-vectors. (3) Cluster the row-vectors 
of A that correspond to the same extreme point into the same 
group. There will be K disjoint groups and each group will 
correspond to the novel words of the same topic. (4) Express 
the remaining row-vectors of A as convex combinations of 
the extreme points. This gives us (3 (5) Finally, renormalize f3 
to obtain (3. 

The reality, however, is that we only have access to X, not 
A. The above algorithm when applied to X would work well 
if X is close to A which would happen if N is large. When N 
is small, two problems arise: (i) Points corresponding to novel 
words of the same topic may become multiple extreme points 
and may be far from each other (e.g., Xi,X 2 and X^,X4 
in Fig. [TJ). (ii) Points in the convex hull may also become 
"outlier" extreme points (e.g., Xq in Fig.[TJ). 

As a step towards overcoming these difficulties we ob- 



serve that in practice, the unique words of any topic only 
occur in a few documents. This implies that the rows of 9 
are sparse and that the row-vectors of X corresponding to 
the novel words of the same topic are likely to form a low- 
dimensional subspace (e.g., Si, S 2 in Fig. [TJi since their sup- 
ports are subsets of the supports of the same row-vector of 
9. If we make the further assumption that for any pair of dis- 
tinct topics there are several documents in which their novel 
words do not co-occur then the row subspaces of X corre- 
sponding to the novel words any two distinct topics are likely 
to be significantly disjoint (although they might share a com- 
mon low-dimensional subspace). Finally, the row-vectors of 
X corresponding to non-novel words are unlikely to be close 
to the row subspaces of X corresponding to the novel words 
any one topic (e.g., X 6 in Fig. [TJ. These observations and 
assumptions motivate the revised 5-step Algorithm [TJ for ex- 
tracting f3 from X. 

Algorithm 1 Topic Discovery 

Input: W x M word-document matrix X; # topics K. 
Output: Estimate f3 of W x K topic matrix f3. 

1: Row-normalize X to get X. Let N w :— YldLi X w d- 

2: Apply Algorithm 12 to rows of X to obtain a subset of 
rows £ that correspond to candidate novel words. Let Co 
be the remaining row indices. 

3: Apply the sparse subspace clustering algorithm of [9] [10] 
to £ with parameters Ai, 7 to obtain K clusters {Ck)k=i 
of novel words and cluster C ou t of outlier words. Rear- 
range the rows of X indexed by Ck into a matrix Yfe. 

4: For each w £ Co 1J C ou t, solve 
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for some A2 > 0. Let {fe^;}^ be the optimal solution. 
5: For w = 1, . . . , W, k = 1, . . . , K, set 



fiwk 



N w l(w £ C k ) 
N w \K k \\i 



for w £ U;=i Ci 
for w £ C U C out 



and normalize each column of (3 to be column stochastic. 



Algorithm 2 Find candidate novel words 

Input: Set of 1 X M probability row-vectors xi,. . . ,xw\ 

Number of projections P; Tolerance S. 
Output: Set £ of candidate novel row-vectors. 



Set £ = 0. 

Generate row-vector d ^ 

imax - = ar g maxi XidF , 



Uniform(unit-sphere in 

imin ■= argmini Xid T . 
Ill < 5 or \\xi-XA. II 



l M ). 
<<5}. 



5: Repeat steps 2 through 4, P times. 



Step (2) of Algorithm [T] finds rows of X many of which 
are likely to correspond to the novel words of topics and some 
to outliers (non-novel words). This step uses Algorithm [2] 
which is a linear-complexity procedure for finding, with high 
probability, extreme points and points close to them (the can- 
didate novel words of topics) using a small number P of ran- 
dom projections. Step (3) uses the state-of-the-art sparse sub- 
space clustering algorithm from (|9] [10] to identify K clus- 
ters of novel words, one for each topic, and an additional 
cluster containing the outliers (non-novel words). Step (4) 
expresses rows of X corresponding to non-novel words as 
convex combinations of these K groups of rows and step 
(5) estimates the entries in the topic matrix and normalizes 
it to make it column-stochastic. In many applications, non- 
novel words occur in only a few topics. The group-sparsity 



penalty \ 2 



1=1 W^wL\\oc 



proposed in ifTTl is used in step (4) 



of Algorithm Q] to favor solutions where the row vectors of 
non-novel words are convex combinations of as few groups 
of novel words as possible. Our proposed algorithm runs in 
polynomial-time in W, M, and K and all the optimization 
problems involved are convex. 

3. EXPERIMENTAL RESULTS 
3.1. Synthetic Dataset 
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Fig. 2. Error of estimated topic matrix in Frobenius norm. 
Upper: W = 500, p = 0.2, N = 50, K = 5; Lower: W 
50(1. /< 0.2. K 10, M = 500. 

In this section, we validate our algorithm on some syn- 
thetic examples. We generate a W x K separable topic ma- 
trix /3 with Wi /K > 1 novel words per topic as follows: first, 
iid 1 x K rows-vectors corresponding to non-novel words are 
generated uniformly on the probability simplex. Then, W% iid 



UniformjO, 1] values are generated for the nonzero entries in 
the rows of novel words. The resulting matrix is then column- 
normalized to get one realization of j3. Let p := W\/W. 
Next, M iid K x 1 column-vectors are generated for the 9 

K 

matrix according to a Dirichlet prior c I] 6?- 1 . Following 

i=i 

fl2l . we set on — 0.1 for all i. Finally, we obtain X by gen- 
erating N iid words for each document. 

For different settings of W, p, K, M and N, we calculate 
the error of the estimated topic matrix f3 as — /3\\f- For 
each setting we average the error over 50 random samples. 
In sparse subspace clustering the value of Ai is set as in ifTol 
(it depends on the size of the candidate set) and the value of 
7 as in [9 | (it depends on the values of N, M). In Step 4 of 
Algorithm[T] we set A 2 = 0.01 for all settings. 

We compare our algorithm against the LDA algorithm [2] 
and a state-of-art NMF-based algorithm [13 |. This NMF al- 
gorithm is chosen because it compensates for the type of noise 
we use in our topic model. Our LDA algorithm uses Gibbs 
sampling for inferencing. Figure |2]depicts the estimation er- 
ror as a function of the number of documents M (top) and the 
number of words/document N (bottom). Evidently, our algo- 
rithm is uniformly better than comparable techniques. Specif- 
ically, while NMF has similar error as our algorithm for large 
M it performs relatively poorly as a function of N. On the 
other hand LDA has similar error performance as ours for 
large N but performs poorly as a function of M. Note that 
both of these algorithms have comparably high error rates for 
small M and N. 

3.2. Swimmer Image Dataset 
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Fig. 3. (a) Example "clean" images (cols, of A) in Swim- 
mer dataset; (b) Corresponding images with sampling "noise" 
(cols, of X); (c) Examples of ideal topics (cols, of /3). 

In this section we apply our algorithm to the synthetic 
swimmer image dataset introduced in [5 1. There are M = 256 
binary images each of W = 32 x 32 = 1024 pixels. Each 
image represents a swimmer composed of four limbs, each of 
which can be in one of 4 distinct positions, and a torso. 

We interpret pixel positions (i, j), 1 < i, j < 32 as words 
in a dictionary. Documents are images, where an image is 
interpreted as a collection of pixel positions with non-zero 
values. Since each of the four limbs can independently take 
one of four positions, it turns out that the topic matrix /3 satis- 
fies the separability assumption with K = 16 "ground truth" 
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Fig. 4. Topics estimated for noisy swimmer dataset by a) proposed algorithm, b) LDA inference using code in lfT2l . c) NMF 
algorithm using code in JT3). Topics closest to the 16 ideal (ground truth) topics LAI, LA2, etc., are shown. LDA misses 5 and 
NMF misses 6 of the ground truth topics while our algorithm recovers all 16 and our topic estimates look less noisy. 




Fig. 5. Topic errors for (a) LDA algorithm [Q2] and (b) NMF 
algorithm iTPH on the Swimmer dataset. Figure depicts topics 
that are extracted by LDA and NMF but are not close to any 
"ground truth" topic. The ground truth topics correspond to 
16 different positions of left/right arms and legs. 

topics that correspond to 16 single limb positions. Following 
the setting of iTOl . we set body pixel values to 10 and back- 
ground pixel values to 1 . We then take each "clean" image, 
suitably normalized, as an underlying distribution across pix- 
els and generate a "noisy" document of N — 200 iid "words" 
according to the topic model. Examples are shown in Fig. [3] 
We then apply our algorithm to the "noisy" dataset. We again 
compare our algorithm against LDA and the NMF algorithm 
from lfl3l . Results are shown in Figures |4] and [5] Values of 
tuning parameters Ai, 7, and A 2 are set as in Sec. 13. II Specif- 
ically, Ai = 0.1, A 2 = 0.01 for the results in Figs.g]and|5] 

This dataset is a good validation test for different algo- 
rithms since the ground truth topics are known and are unique. 
As we see in Fig. [5] both LDA and NMF produce topics that 
do not correspond to any pure left/right arm/leg positions. In- 
deed, many estimated topics are composed of multiple limbs. 
Nevertheless, no such errors are realized in our algorithm and 
our topic-estimates are closer to the ground truth images. 

3.3. Text Corpora 

In this section, we apply our algorithm on two different text 
corpora, namely, the NIPS dataset OH and the New York (NY) 
Times dataset flT3]. In the NIPS dataset, there are M = 2484 
documents with W = 14036 words in the vocabulary. There 
are, on average, N 900 words in each document. In the 
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"law" 


"market" 


"game" 
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game 
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law 
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election 
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campaign 


charge 
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run 
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court 


business 


season 



Table 1. Most frequent words in examples of estimated top- 
ics. Upper: NIPS, with K = 40 topics; Lower: NY Times, 
with K = 20 topics 

NY Times dataset, M = 3000, W = 9340, and N 270. 
The vocabulary is obtained by deleting a standard "stop" word 
list used in computational linguistics, including numbers, in- 
dividual characters, and some common English words such as 
"the". Words that occur less than 5 times in the dataset and the 
words that occur in less than 5 documents are removed from 
the vocabulary as well. The tuning parameters Ai, 7, and A2 
are set in the same way as in Sec. [3j] (specifically, Ai = 0.1 
and A 2 = 0.1). 

Table [T] depicts typical topics extracted by our algorithm. 
For each topic we show its most frequent words, listed in de- 
scending order of estimated probability. Although there is no 
"ground truth" to compare with, the most frequent words in 
the estimated topics do form recognizable themes. For ex- 
ample, in the NIPS dataset, the set of (most frequent) words 
"chip", "circuit", etc., can be annotated as "IC Design"; The 
words "visual", "cells", etc., can be labeled as "human visual 
system". As a point of comparison, we also experimented 
with related convex programming algorithms [8. 7 1 that have 
recently appeared in the literature. We found that they fail to 
produce meaningful results for these datasets. 
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